[Diffusion] [AMD] Online MXFP4 and FP8 Quantization for Multimodal Generation#21431
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Co-authored-by: Bowen Bao <bowenbao@amd.com>
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
mickqian
left a comment
There was a problem hiding this comment.
could you also mention this new server arg in cli.md, quantization.md or other related places?
Added documentation in |
…fig fix for zimage, and mxfp4 perf improvements
Definitely, I'm happy either way as long as the functionality lands! I originally didn't notice this PR before I had already created a new one. |
|
|
||
| ## Online Quantization | ||
|
|
||
| Online quantization applies quantization to unquantized models at load time. This is useful for when pre-quantized checkpoints are not available. |
There was a problem hiding this comment.
nit: add (on-the-fly / load-time quantization) as well
|
/tag-and-rerun-ci |
|
@ColinZ22 please fix lint checks |
|
Fixed, @mickqian @wisclmy0611 Re-review would be greatly appreciated! Hoping to land this PR soon. |
|
@amd-bot ci-status |
1 similar comment
|
@amd-bot ci-status |
|
@ColinZ22 FP8 path is broken on main - FYI. |
|
@ColinZ22 Currently we will see error if we --enable-torch-compile true with --quantization fp8, any suggestion? |
|
Hi @yichiche, could you try adding It should fix the error and allow torch compile to bring significant performance improvements. (For the online MXFP4 quantization path another fix is needed to enable torch compile; SGLang Diffusion disables torch compile for all paths by default due to issues like this) |
…l-project#21431) Co-authored-by: Bowen Bao <bowenbao@amd.com> Co-authored-by: HAI <hixiao@gmail.com>
Motivation
Adding Online MXFP4 (For AMD GPUs) and FP8 Quantization for multimodal (image and video) generation with models like Z-Image-Turbo and Wan 2.2.
Modifications
--quantizationserver argument allowing loading unquantized model and quantizing weights and activations to MXFP4.--quantization-ignored-layersserver argument allows skipping certain layers for online quantization (keeping in full precision)Mxfp4ConfigandMxfp4LinearMethodclasses utilizing AITER dynamic MXFP4 quantization and MXFP4 GEMM kernels.--quantization.Usage Example
To online quantize a Diffusion Model to FP8 or MXFP4, simply add the
--quantizationargument:Generation Quality Comparison
Prompt 1: "A cat sitting at the top of a mountain looking down at a futuristic city"
Prompt 2: "A crowd of people of various age at a busy outdoor marketplace"
Prompt 3: "A young child blowing dandelion seeds, golden hour lighting"
Prompt 4: "A city street at sunset with snow-capped mountain in the distant background"
Performance Benchmarking
Model: Z-Image-Turbo
Dataset: 200 images from HuggingFace Parti-Prompts
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci